This project can be found within the following GitHub repository link.

0. Sources loading

First all needed libraries are loaded. This includes data.table, DT and lubridate to work with datasets (that also include dates), ggplot2, ggthemes, GGally, grid, gridExtra, plotly, leaflet and corrplot for data exploration/visualization purposes and several other machine learning libraries used for modeling.

All self-programed functions such as transforming or splitting a dataset are saved within the R notbook called functions.

source('notebooks/libraries.R')
source('notebooks/functions.R')

The work is divided into two R markdowns. The first one only focusses on data exploration and preparation and the second one on modeling. This is due to the fact that each markdown takes some while to be created and with this division the whole data exploration does not need to be computed again while modeling.

1. Data sets

1.1 Data loading

Two data sets are provided. Both cover house sales within King County, USA and each row represents a single house sale between May 2014 and May 2015. While the train data set includes the price for each sale the test data set does not. Thus we are provided with a “blind” test data set. The train data set will be used to train and tune the machine learning model, while the test data set will be used to make price predictions based on the provided data and its features.

raw_train_data<-fread('data/house_price_train.csv', stringsAsFactors = F)
raw_test_data<-fread('data/house_price_test.csv', stringsAsFactors = F)

1.2 Data description

Consequently the train data set includes 21 columns while the test data set has only 20 columns. While the train data has 17,277 house sales, the test data has 4,320 house sales. Both data sets actually come from a complete data set that can be found on Kaggle under following link. However, this data set was probably then randomly splitted into train and test data using a 80:20 split and then for the test data set the prices were removed. Thus both data sets share the same 19 house features and each sale is identfied with a numeric id. Most of the features are directly related to the house properties such as number of bedrooms, squarefeet, waterfront, floors etc. In addition, we have information of the building year of the house and the lastest renovation (if there was any). There are also features that try to capture the overall condition of the house such as grade and condition. However, these two variables are likely to capture the same information and thus one might be redundant. Furthermore location data is provided as well as how often the place was visited. There are two variables that are most likely related to the average lot size and also average living size of the 15 closest neighbors (there is no clear explaination what these two variables really mean but this explaination makes the most sense). Overall, the dataset covers the most important house features that are also usually used on house sale websites. From the following link an overview and description of the variables was obtained.

str(raw_train_data)
## Classes 'data.table' and 'data.frame':   17277 obs. of  21 variables:
##  $ id           :integer64 9183703376 464000600 2224079050 6163901283 6392003810 7974200948 2426059124 2115510300 ... 
##  $ date         : chr  "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(raw_test_data)
## Classes 'data.table' and 'data.frame':   4320 obs. of  20 variables:
##  $ id           :integer64 6414100192 6054650070 16000397 2524049179 8562750320 7589200193 9547205180 1432701230 ... 
##  $ date         : chr  "12/9/2014" "10/7/2014" "12/5/2014" "8/26/2014" ...
##  $ bedrooms     : int  3 3 2 3 3 3 3 3 3 5 ...
##  $ bathrooms    : num  2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
##  $ sqft_living  : int  2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
##  $ sqft_lot     : int  7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
##  $ floors       : num  2 1 1 1 2 1.5 1.5 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ condition    : int  3 4 4 3 3 4 3 4 3 4 ...
##  $ grade        : int  7 7 7 9 8 8 8 6 8 8 ...
##  $ sqft_above   : int  2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
##  $ sqft_basement: int  400 0 0 720 0 0 790 360 0 1510 ...
##  $ yr_built     : int  1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
##  $ yr_renovated : int  1991 0 0 0 0 0 2002 0 0 0 ...
##  $ zipcode      : int  98125 98074 98002 98040 98027 98117 98115 98058 98019 98056 ...
##  $ lat          : num  47.7 47.6 47.3 47.5 47.5 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
##  $ sqft_lot15   : int  7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
##  - attr(*, ".internal.selfref")=<externalptr>
datatable(raw_train_data)
NA detection

Overall, we see that the data does not have any missing values, which is an imporant aspect in terms of data quality.

sum(is.na(raw_train_data))
## [1] 0
sum(is.na(raw_test_data))
## [1] 0

1.3 Data preparation

By looking at the data types of the several features, we can see right away that some of them are not in the correct format. Thus with the self-programmed function prep_data several datatypes for both data sets are changed:

  • date: chr to date (for later feature creation)
  • zip code: int to factor

The function is specifically programmed for these data sets and are applied to both train and test. The transformed data sets are then saved within a new variable.

There are several features that appear to be rather factor variables than numbers or integers. They represent discrete variables and should be turned into factors. Before turning them into factors, first data visualization is performed to actually see if we find linear relationship between those features and the target variable. If this is the case we can leave them as numeric or integer variable, in any other case they should be turned into factors.

Train data
prep_train_data <- prep_data(raw_train_data)
str(prep_train_data)
## Classes 'data.table' and 'data.frame':   17277 obs. of  21 variables:
##  $ id           :integer64 9183703376 464000600 2224079050 6163901283 6392003810 7974200948 2426059124 2115510300 ... 
##  $ date         : Date, format: "2014-05-13" "2014-08-27" ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : Factor w/ 70 levels "98001","98002",..: 19 52 15 63 50 50 37 14 33 50 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  - attr(*, ".internal.selfref")=<externalptr>
Test data
prep_test_data <- prep_data(raw_test_data)
str(prep_test_data)
## Classes 'data.table' and 'data.frame':   4320 obs. of  20 variables:
##  $ id           :integer64 6414100192 6054650070 16000397 2524049179 8562750320 7589200193 9547205180 1432701230 ... 
##  $ date         : Date, format: "2014-12-09" "2014-10-07" ...
##  $ bedrooms     : int  3 3 2 3 3 3 3 3 3 5 ...
##  $ bathrooms    : num  2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
##  $ sqft_living  : int  2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
##  $ sqft_lot     : int  7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
##  $ floors       : num  2 1 1 1 2 1.5 1.5 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ condition    : int  3 4 4 3 3 4 3 4 3 4 ...
##  $ grade        : int  7 7 7 9 8 8 8 6 8 8 ...
##  $ sqft_above   : int  2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
##  $ sqft_basement: int  400 0 0 720 0 0 790 360 0 1510 ...
##  $ yr_built     : int  1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
##  $ yr_renovated : int  1991 0 0 0 0 0 2002 0 0 0 ...
##  $ zipcode      : Factor w/ 70 levels "98001","98002",..: 56 38 2 26 16 52 50 33 12 32 ...
##  $ lat          : num  47.7 47.6 47.3 47.5 47.5 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
##  $ sqft_lot15   : int  7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
##  - attr(*, ".internal.selfref")=<externalptr>

1.4 Data visualization

One of the most important things in machine learning is to get familiar with the data set. Data exploration and in particular data visualization is very helpful in this regard. This part will only look at the train data set since the model is trained on it and the whole purpose is to create a model that has good generalization power. Thus all the information in the test data set is not relevant at this point.

Distribution

In order to have a general feeling for the train dataset, a table is created that shows the general distribution of all numeric variables. Especially the price quantiles can help later in the visualization when price bins are needed.

df <- as.data.frame(prep_train_data)
df <- df[ , !(names(df) %in% c('id', 'zipcode', 'lat', 'long', 'date'))]
distribution <- as.data.frame(t(sapply(df, quantile)))
distribution$Mean <- sapply(df, mean)
distribution$SD <- sapply(df, sd)
datatable(round(distribution, 2))

1.4.1 Price distribution

Since price is the target variable the first step is too look at the distribution of this continuous variable.

price_dist <- ggplot(prep_train_data, aes(x = price)) +    
              geom_histogram(alpha = 0.8, fill='greenyellow') +
              labs(x= 'Price',y = 'Count', title = 'Price distribution') + 
              theme_bw()+  
              theme(text = element_text(face = "bold"),
                    panel.border = element_blank(), 
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    axis.line = element_blank(),
                    axis.ticks = element_blank(),
                    legend.position="none")


ggplotly(price_dist)

Since there are some houses that sold for very high prices (>2 million) the distribution chart does not reveal much due to the high skewness to the right. Thus the distribution chart is now cut off at 2 mio in order to gain a better picture of the price distribution

price_dist_cut <- ggplot(prep_train_data, aes(x = price)) +    
              geom_histogram(alpha = 0.8, fill='greenyellow') +
              scale_x_continuous(limits=c(0,2e6)) +
              labs(x= 'Price',y = 'Count', title = 'Cutted price distribution') + 
              theme_bw()+  
              theme(text = element_text(face = "bold"),
                    panel.border = element_blank(), 
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    axis.line = element_blank(),
                    axis.ticks = element_blank(),
                    legend.position="none")



ggplotly(price_dist_cut)

The distribution is still skewed to the right but now we see that most houses sold between 200k and 1mio. By setting the cut-off point at 2mio, 157 house sales (0.09%) were not included due to their high selling prices. Nevertheless it is important to determine why some of the houses sold for such a large amount.

Another option is to use the logarithm of the target variable and check if it is normally distributed. It might make sense to work with a log-transformed target and then transform the predtictions back to their original scale using the exponential.

price_data <- log(prep_train_data$price)
price_data <- as.data.frame(price_data)
names(price_data)[1] <- "price_log"


price_log_dist <- ggplot(price_data, aes(x = price_log)) +    
              geom_histogram(alpha = 0.8, fill='greenyellow') +
              labs(x= 'Price',y = 'Count', title = 'Log price distribution') + 
              theme_bw()+  
              theme(text = element_text(face = "bold"),
                    panel.border = element_blank(), 
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    axis.line = element_blank(),
                    axis.ticks = element_blank(),
                    legend.position="none")


ggplotly(price_log_dist)

As we can see the log of prices is actually quiet close to a normal distribution. It might make sense to actually use the log of the target for a linear model and then transform back the predictions on the initial scale by using the exponential.

1.4.2 Discrete variables

Since there are several integer variables that are discrete rather than continuous variables, it makes sense to look at them in a separate way than continuous variables. First all the names of discrete variables are stored within a variable. Then a new dataframe is created that only stores the discrete variables. Since these variables stored discrete values they are transformed into factores and afterwards are melted.

discVar <- c("bedrooms", "bathrooms", "floors", "waterfront", "view", "condition", "grade")

df_disc <- prep_train_data[, ..discVar]
df_disc <- sapply(df_disc, as.factor)
df_disc <- as.data.frame(melt(df_disc))
df_disc$value <- factor(df_disc$value, levels=sort(as.numeric(levels(df_disc$value))), ordered=TRUE)

options(repr.plot.width = 24, repr.plot.height = 8)

disc_dist <- ggplot(df_disc, aes(value)) +
      geom_bar(aes(fill = Var2)) + 
      scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
      scale_x_discrete(expand = c(0,0)) +
      facet_wrap(~Var2, scales = "free", nrow = 3) +
      scale_fill_tableau() +
      ggtitle("Count of each discrete variable") +
      labs(fill = "", x = "", y = "") +
      theme_minimal() +
      theme(text = element_text(face = "plain"),
            legend.position = "none",
            axis.text.x = element_text(angle = 90, size = 7),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
           plot.title = element_text(hjust = 0.5)) 

disc_dist

There are several features that have a highly unbalanced distribution such as views (many never viewed = 0), waterfront (most not on waterfront) and renovated (most not renovated.). Something else we see right away by looking at the histograms is that it seems that for bedroom and bathroom there are some questionable values. While there are houses without a bathrooms (= 0), another house appears to have 33 bedrooms. Thus it makes sense to further look at outliers in this case.

options(repr.plot.width = 10, repr.plot.height = 10)

disc_box <- ggplot(df_disc, aes(Var2, as.numeric(value))) +
                geom_boxplot(aes(fill = Var2)) +
                scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
                scale_x_discrete(expand = c(0,0)) +
                facet_wrap(~Var2, scales = "free", ncol = 1) +
                scale_fill_tableau() +
                ggtitle("Distribution of each discrete variable") +
                labs(fill = "", x = "", y = "") +
                coord_flip() +
                theme_light() +
                theme(text = element_text(face = "bold"),
                      legend.position = "none",
                      axis.text.x = element_blank(),
                      panel.grid.major = element_blank(),
                      panel.grid.minor = element_blank(),
                      plot.title = element_text(hjust = 0.5),
                      strip.background = element_blank(),
                      strip.text.x = element_blank())

disc_box

We see again that especially bedroom and bathroom are variables with several outliers. Thus we will take a closer look at it in the data cleaning part.

1.4.3 continuous variables

Similar to the discrete variables, we also have a look at the several continuous variables. Six of them are all related to squarefeet measures the last one is grade. In order to plot them properly a new melted dateframe is created.

contVar <- c("sqft_living", "sqft_lot", "sqft_above", "sqft_basement", "sqft_living15", "sqft_lot15")
df_cont <- prep_train_data[,..contVar]
df_cont <- as.data.frame(melt(df_cont))

We see that all of these continuous variables are skewed to the right. Especially the one once related to lot sizes showing that there are many with very little lot sizes and very few with large lot sizes. For sqft of the basement we see similar to the variable of having a basement or not that a large proportion of the houses have no basement and thus a sqft size of 0 for basement. We will take a closer look in the following at the year built variable.

options(repr.plot.width = 12, repr.plot.height = 6)

cont_dist <- ggplot(df_cont, aes(value)) +
        geom_density(aes(fill = variable)) +
        facet_wrap(~variable, scales = "free", nrow = 2) +
        labs(x = "", y = "", fill = "") +
        theme_minimal() +
        scale_fill_tableau() +
        ggtitle("Distribution of each continuous variable") +
        theme(text = element_text(face = "bold"),
              legend.position = "none",
              axis.text.x = element_text(angle = 45),
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
             plot.title = element_text(hjust = 0.5))

cont_dist

Since we know we have highly skewed variables it make sense to take a look at their boxplots. However, using non-normalized boxplots does not reveal much due to the large number of outliers

cont_box <- ggplot(df_cont, aes(variable, value)) +
          geom_boxplot(aes(fill = variable)) +
          coord_flip() +                                
          scale_fill_tableau() +
          labs(x = "", y = "") +
          theme_minimal() +
          theme(text = element_text(face = "bold"),
                legend.position = "none",
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
               plot.title = element_text(hjust = 0.5),
               axis.text.x = element_blank())                                
                            
                      
cont_box

Thus the variables are normalized using the max-min normalizer. Now the boxplots reveal a bit more but still many variables have a high number of outliers.

df_cont_norm <- prep_train_data[,..contVar]
df_cont_norm <- as.data.frame(apply(df_cont_norm, 2,function(x)((x - min(x))/(max(x)-min(x)))))
df_cont_norm <- as.data.frame(melt(df_cont_norm))

cont_box_norm <- ggplot(df_cont_norm, aes(variable, value)) +
              geom_boxplot(aes(fill = variable)) +
              coord_flip() +                                
              scale_fill_tableau() +
              labs(x = "", y = "") +
              theme_minimal() +
              theme(text = element_text(face = "bold"),
                    legend.position = "none",
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                   plot.title = element_text(hjust = 0.5),
                   axis.text.x = element_blank())                                
                            
                      
cont_box_norm

1.4.4 Building year

Overall, we see that the number of houses built per year seems to rise. However especially during economic downturns such as the 1930s and the years following the financial crisis in 2008. However, it is likely that this does not reflect the actual building pattern in this area since in specific years some houses from a specific time could be torn down or a particularly popular. However, there might be actual price trends that could be related to the building year, since some century building style could be more popular than others.

options(repr.plot.width = 10, repr.plot.height = 5)

year_plot <- ggplot(prep_train_data, aes(yr_built)) +
        geom_bar(fill = "coral4") +
        scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
        scale_x_continuous(breaks = scales::pretty_breaks(n = 10), expand = c(0,0)) +
        ggtitle("Houses built per year") +
        theme_minimal() +
        theme(text = element_text(face = "bold"),
             plot.title = element_text(hjust = 0.5))

ggplotly(year_plot)

While a simple scatterplot does not help to acutally detect any pattern, a smoothing average line actually reveals that prices for houses built between 1940 and 1980 do not seem to sell at high prices on average. While there seem to be tweo houses built before 1940 that sold for very high prices (>7mio), most of the high priced houses were built after 1990.

options(repr.plot.width = 10, repr.plot.height = 5)

y1 <- ggplot(prep_train_data, aes(yr_built, price)) +
        geom_point(colour = "greenyellow") +
        ggtitle("Prices throughout construction years") +
        scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
        scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
        theme_minimal() +
        theme(text = element_text(face = "bold"),
             axis.text.x = element_blank(),
             axis.title.x = element_blank())

y2 <- ggplot(prep_train_data, aes(yr_built, price)) +
        geom_smooth(se = FALSE, colour = "greenyellow") +
        ggtitle("Average prices throughout construction years") +
        scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
        scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
        theme_minimal() +
        theme(text = element_text(face = "bold"))

output1 <- grid.arrange(y1, y2)

1.4.5 Location data

In order to determine if the location of a house actually effects its price all houses were assigned to five different price bins. While 300k (25% quantile), 450k (50% quantile) and 600k (75% quantile) are based on the price distribution within the data set, another cut was chosen at 2mio based on the highly skewed price histogram. While less expensive houses are grey and the most expensive houses appear in black. Thus creating a good contrast. The other the price bins become more red the more expensive they are.

We see overall especially houses close to the water are very expensive and that several neighborhood show many houses within the same price bin. Thus including the zip code and location data within the machine learning model might be beneficial. Especially when using a random forest this might work out well.

Clustering based on the location could be performed here as well, but as a first step zip code should reflect a similar level of granularity here.

location_data <- data.frame(prep_train_data)
location_data$PriceBin<-cut(location_data$price, c(0,300000,450000,650000,2000000,9000000))

center_lon = median(location_data$long,na.rm = TRUE)
center_lat = median(location_data$lat,na.rm = TRUE)

factpal <- colorFactor(c("#BDBDBD","#58FA82", "#F4FA58", "#FAAC58", "#FA5882", "#1C1C1C"), 
                       location_data$PriceBin)



leaflet(location_data) %>% addProviderTiles("Esri.OceanBasemap") %>%
  addCircles(lng = ~long, lat = ~lat, 
             color = ~factpal(PriceBin))  %>%
  # controls
  setView(lng=center_lon, lat=center_lat,zoom = 12) %>%
  
  addLegend("bottomright", pal = factpal, values = ~PriceBin,
            title = "House Price Distribution",
            opacity = 1)

1.4.6 Correlation analysis

We can observe that sqft_living and sqft_lot are highly correlated with sqft_living15 and sqft_lot15. Apparently these values are related to updates and renovations in 2015.

There are some variables with very high correlation to the price:

  • sqft_living = 0.7
  • grade = 0.67
  • sqft_above = 0.61 (however highly correlated with sqft_living = 0.88)
  • sqft_living15 = 0.59 (however highly correlated with sqft_living = 0.76)
  • bathrooms = 0.53 (however highly correlated with sqft_living = 0.76)

We will need to take a closer look at the residuals of our model later on in order to check the effect of multicoliniarity and then decide if it is necessary to further exclude high correlated features.

corr_data <- as.data.frame(prep_train_data)
corr_data <- corr_data[ , !(names(corr_data) %in% c('id', 'zipcode', 'date'))]
CorrelationResults = cor(corr_data)

corrplot(CorrelationResults, method = "color", outline = T, cl.pos = 'n', rect.col = "black",  tl.col = "indianred4", addCoef.col = "black", number.digits = 2, number.cex = 0.60, tl.cex = 0.7, cl.cex = 1, col = colorRampPalette(c("green4","white","red"))(100))

As we said before it might be that several of our discrete variables do not show a linear relationship to the price and thus should be turned into factore. Thus the relationship of the discrete as well as the continuous variables is checked in the following.

Discrete variable relationship to the price

Most of the discrete variables appear to have no linear relationship with the target variable and thus can be turned into factor variables. However, number of bathrooms and grade seem to have a linear relatonship and thus should stay numeric variables. Furthermore, the variable condition, month and year show very little correlation of less than 0.05 should be excluded for the modeling. In the following data cleaning part the data type transformations will be done.

options(repr.plot.width = 18, repr.plot.height = 18)

corr_data2 <- data.frame(prep_train_data)

corr_disc <- ggpairs(corr_data2[,c("price", discVar)]) +
  theme_minimal() +
  theme(text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 90),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
corr_disc

continuous variable relationship to the price

Also for the continuous variables we that that most of them show some kind of linear relation ship to the target. However, the variables sqft_lot, sqft_lot15 and house age do not show a good relationship and very low correlation values to the target. Thus it might make sense to exclude them from the model since their correlation is less than 0.1.

In regard to house age and consquently year_built we see that the correlation is low and no apparent relationship can be seen. However, a thing to consider here is to create bins based on the building year decade since we saw before a drop in average house prices for a specific period in time.

options(repr.plot.width = 18, repr.plot.height = 18)

corr_cont <- ggpairs(corr_data2[,c("price", contVar)]) +
  theme_minimal() +
  theme(text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 90),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
corr_cont

1.5 Data cleaning

1.5.1 Questionable data entries

We have seen before that there were several data points with unusual data. For example, the house with 33 and the other with 11 bedrooms. Both do not seem to be correct and thus are excluded.

Eventhough it seemed in the histograms that there are houses with 0 bathrooms. However, there is none with 0 bathrooms but several with less than 1 (e.g. 0.5 or 0.75). This could actually be based on the limited things within a bathroom (e.g. no shower). These values might be correct and thus the rows are not excluded.

show <- prep_train_data[which(prep_train_data$bedrooms > 10),]
datatable(show)
prep_train_data <- prep_train_data[-which(prep_train_data$bedrooms > 10),]

show<- prep_train_data[which(prep_train_data$bathrooms < 1),]
datatable(show)

1.5.3 Data type

As said before it makes sense to turn several variables into factor features. The remaining numerical features that are still saved as integer are then turned into numeric features. The next step will be dummy encoding but we already save now a not encoded data set.

prep_train_data3 <- data_types(prep_train_data)
prep_test_data3 <- data_types(prep_test_data)

fwrite(prep_train_data3, 'data/data_train_ready.csv', row.names = F)
fwrite(prep_test_data3, 'data/data_test_ready.csv', row.names = F)

1.5.4 Dummy encoding of factor

Since most algorithms require dummy encoding we are transforming our factor variables into dummies. However, id and date should not be encoded. Thus we have to exclude them from dummy encoding and then get them back from the old data set. When performing dummy encoding though, it came apparent that the test and the train set have differences among their factor variables. The test set resulted in only 102 variables and the train set in 105 variables. Thus two encoded variables are probably missing in the test set. Thus the encoding needs to be performed on a combined data set.

Combine data sets
prep_train_data3$dataset <- "train"
prep_test_data3$dataset <- "test"
prep_test_data3$price <- NA

total_data <- rbind(prep_train_data3, prep_test_data3)
Dummy encode
total_encoded<-caret::dummyVars(formula= ~.- id -date , data = total_data, fullRank=F,sep = "_")
total_encoded<-data.table(predict(total_encoded, newdata = total_data))

names(total_encoded)<-gsub('-','_',names(total_encoded))

total_encoded$id <- total_data$id
total_encoded$date <- total_data$date

str(total_data)
## Classes 'data.table' and 'data.frame':   21595 obs. of  22 variables:
##  $ id           :integer64 9183703376 464000600 2224079050 6163901283 6392003810 7974200948 2426059124 2115510300 ... 
##  $ date         : Date, format: "2014-05-13" "2014-08-27" ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : Factor w/ 10 levels "1","2","3","4",..: 3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : num  1250 2220 3980 1890 1814 ...
##  $ sqft_lot     : num  7500 2550 209523 7540 5000 ...
##  $ floors       : Factor w/ 6 levels "1","1.5","2",..: 1 5 3 1 1 3 3 1 3 1 ...
##  $ waterfront   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : Factor w/ 5 levels "0","1","2","3",..: 1 3 3 1 1 1 1 1 1 1 ...
##  $ condition    : num  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : num  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : num  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: num  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : num  1967 1990 2006 1967 1951 ...
##  $ yr_renovated : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : Factor w/ 70 levels "98001","98002",..: 19 52 15 63 50 50 37 14 33 50 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: num  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : num  7563 5610 65775 8515 5000 ...
##  $ dataset      : chr  "train" "train" "train" "train" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(total_encoded)
## Classes 'data.table' and 'data.frame':   21595 obs. of  110 variables:
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms_1   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_2   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_3   : num  1 1 0 0 0 0 0 1 0 1 ...
##  $ bedrooms_4   : num  0 0 1 1 1 1 1 0 1 0 ...
##  $ bedrooms_5   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_6   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_7   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_8   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_9   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedrooms_10  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : num  1250 2220 3980 1890 1814 ...
##  $ sqft_lot     : num  7500 2550 209523 7540 5000 ...
##  $ floors_1     : num  1 0 0 1 1 0 0 1 0 1 ...
##  $ floors_1.5   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ floors_2     : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ floors_2.5   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ floors_3     : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ floors_3.5   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ waterfront   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_0       : num  1 0 0 1 1 1 1 1 1 1 ...
##  $ view_1       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_2       : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ view_3       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_4       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : num  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : num  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: num  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : num  1967 1990 2006 1967 1951 ...
##  $ yr_renovated : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98001: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98002: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98003: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98004: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98005: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98006: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98007: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98008: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98010: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98011: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98014: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98019: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98022: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98023: num  0 0 0 0 0 0 0 1 0 0 ...
##  $ zipcode_98024: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ zipcode_98027: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98028: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98029: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98030: num  1 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98031: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98032: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98033: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98034: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98038: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98039: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98040: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98042: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98045: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98052: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98053: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98055: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98056: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98058: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ zipcode_98059: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98065: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98070: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98072: num  0 0 0 0 0 0 1 0 0 0 ...
##  $ zipcode_98074: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98075: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98077: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98092: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98102: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98103: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98105: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98106: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98107: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98108: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98109: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98112: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98115: num  0 0 0 0 1 1 0 0 0 1 ...
##  $ zipcode_98116: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98117: num  0 1 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98118: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98119: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98122: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98125: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98126: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98133: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98136: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98144: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98146: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98148: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98155: num  0 0 0 1 0 0 0 0 0 0 ...
##  $ zipcode_98166: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98168: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98177: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode_98178: num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
##  - attr(*, ".internal.selfref")=<externalptr>
Split data sets and save in csv

At the end the data sets are split again based on the classification before on train and test. The train dataset has still 17,275 rows (two row deleted due to too many bedrooms) and the test 4,320 rows. Furthermore the sum of prices for the train data set before and the encoded one is the same.

data_train_ready <- subset(total_encoded, datasettrain==1)
data_test_ready <- subset(total_encoded, datasettrain==0)
data_test_ready$price <- NULL
data_test_ready$datasettrain <- NULL
data_train_ready$datasettrain <- NULL
data_test_ready$datasettest <- NULL
data_train_ready$datasettest <- NULL

sum(data_train_ready$price) == sum(prep_train_data$price)
## [1] TRUE
fwrite(data_train_ready, 'data/data_train_dummy.csv', row.names = F)
fwrite(data_test_ready, 'data/data_test_dummy.csv', row.names = F)